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Preface 



From the Preface to the First Edition 

When writing a graduate level mathematics book during the last decade of 
the twentieth century, one probably ought not inquire too closely into one’s 
motivation. In fact, if ones own pleasure from the exercise is not sufficient to 
justify the effort, then one should seriously consider dropping the project. Thus, 
to those who (either before or shortly after opening it) ask for whom was this 
book written , my pale answer is me; and, for this reason, I thought that I should 
preface this preface with an explanation of who I am and what were the peculiar 
educational circumstances that eventually gave rise to this somewhat peculiar 
book. 

My own introduction to probability theory began with a private lecture from 
H.P. McKean, Jr. At the time, I was a (more accurately, the) graduate student 
of mathematics at what was then called The Rockefeller Institute for Biologi- 
cal Sciences. My official mentor there was M. Kac, whom I had cajoled into 
becoming my adviser after a year during which I had failed to insert even one 
micro-electrode into the optic nerves of innumerable limuli. However, as I soon 
came to realize, Kac had accepted his role on the condition that it would not 
become a burden. In particular, he had no intention of wasting much of his 
own time on a reject from the neurophysiology department. On the other hand, 
he was most generous with the time of his younger associates, and that is how 
I wound up in McKean’s office. Never one to bore his listeners with a lot of 
dull preliminaries, McKean launched right into a wonderfully lucid explanation 
of P. Levy’s interpretation of the infinitely divisible laws. I have to admit that 
my appreciation of the lucidity of his lecture arrived nearly a decade after its 
delivery, and I can only hope that my reader will reserve judgment of my own 
presentation for an equal length of time. 

In spite of my perplexed state at the end of McKean’s lecture, I was sufficiently 
intrigued to delve into the readings that he suggested at its conclusion. Knowing 
that the only formal mathematics courses that I would be taking during my 
graduate studies would be given at N.Y.U. and guessing that those courses would 
be oriented toward partial differential equations, McKean directed me to material 
which would help me understand the connections between partial differential 
equations and probability theory. In particular, he suggested that I start with 
the, then recently translated, two articles by E.B. Dynkin which had appeared 
originally in the famous 1956 volume of Teoriya Veroyatnostei i ee Primeneniya. 
Dynkin’s articles turned out to be a godsend. They were beautifully crafted to 
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tell the reader enough so that he could understand the ideas and not so much 
that he would become bored by them. In addition, they gave me an introduction 
to a host of ideas and techniques (e.g., stopping times and the strong Markov 
property), all of which Kac himself consigned to the category of overelaborated 
measure theory. In fact, it would be reasonable to say that my thesis was simply 
the application of techniques which I picked up from Dynkin to a problem that 
I picked up by reading some notes by Kac. Of course, along the way I profited 
immeasurably from continued contact with McKean, a large number of courses 
at N.Y.U. (particularly ones taught by M. Donsker, F. John, and L. Nirenberg), 
and my increasingly animated conversations with S.R.S. Varadhan. 

As I trust the preceding description makes clear, my graduate education was 
anything but deprived; I had ready access to some of the very best analysts 
of the day. On the other hand, I never had a proper introduction to my held, 
probability theory. The first time that I ever summed independent random 
variables was when I was summing them in front of a class at N.Y.U. Thus, 
although I now admire the magnificent body of mathematics created by A.N. 
Kolmogorov, P. Levy, and the other twentieth-century heroes of the held, I 
am not a dyed-in-the-wool probabilist (i.e., what Donsker would have called a 
true coin-tosser). In particular, I have never been able to develop sufficient 
sensitivity to the distinction between a proof and a probabilistic proof. To me, 
a proof is clearly probabilistic only if its punch-line comes down to an argument 
like P(A) < P(B ) because A C B\ and there are breathtaking examples of such 
arguments. However, to base an entire book on these examples would require a 
level of genius that I do not possess. In fact, I myself enjoy probability theory 
best when it is inextricably interwoven with other branches of mathematics and 
not when it is presented as an entity unto itself. For this reason, the reader 
should not be surprised to discover that he finds some of the material presented 
in this book does not belong here ; but I hope that he will make an effort to figure 
out why I disagree with him. 

Preface to the Second Edition 

My favorite “preface to a second edition” is the one that G.N. Watson wrote for 
the second edition of his famous treatise on Bessel functions. The first edition 
appeared in 1922, the second came out in 1941, and Watson had originally 
intended to stay abreast of developments and report on them in the second 
edition. However, in his preface to the second edition Watson admits that his 
interest in the topic had “waned” during the intervening years and apologizes 
that, as a consequence, the new edition contains less new material than he had 
thought it would. 

My excuse for not incorporating more new material into this second edition is 
related to but somewhat different from Watson’s. In my case, what has waned 
is not my interest in probability theory but instead my ability to assimilate 
the transformations that the subject has undergone. When I was a student, 
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probabilists were still working out the ramifications of Kolmogorov’s profound 
insights into the connections between probability and analysis, and I have spent 
my career investigating and exploiting those connections. However, about the 
time when the first edition of this book was published, probability theory began 
a return to its origins in combinatorics, a topic in which my abilities are woefully 
deficient. Thus, although I suspect that, for at least a decade, the most exciting 
developments in the field will have a strong combinatorial component, I have 
not attempted to prepare my readers for those developments. I repeat that my 
decision not to incorporate more combinatorics into this new edition in no way 
reflects my assessment of the direction in which probability is likely to go but 
instead reflects my assessment of my own inability to do justice to the beautiful 
combinatorial ideas that have been introduced in the recent past. 

In spite of the preceding admission, I believe that the material in this book 
remains valuable and that, no matter how probability theory evolves, the ideas 
and techniques presented here will play an important role. Furthermore, I have 
made some substantive changes. In particular, I have given more space to in- 
finitely divisible laws and their associated Levy processes, both of which are now 
developed in rather than just in M. In addition, I have added an entire chap- 
ter devoted to Gaussian measures in infinite dimensions from the perspective of 
the Segal-Gross school. Not only have recent developments in Malliavin calculus 
and conformal field theory sparked renewed interest in this topic, but it seems to 
me that most modern texts pay either no or too little attention to this beautiful 
material. Missing from the new edition is the treatment of singular integrals. I 
included it in the first edition in the hope that it would elucidate the similarity 
between cancellations that underlie martingale theory, especially Burkholder’s 
Inequality, and Calderon-Zygmund theory. I still believe that these similarities 
are worth thinking about, but I have decided that my explanation of them led 
me too far astray and was more of a distraction than a pedagogically valuable 
addition. 

Besides those mentioned above, minor changes have been made throughout. 
For one thing, I have spent a lot of time correcting old errors and, undoubtedly, 
inserting new ones. Secondly, I have made several organizational changes as well 
as others that are remedial. A summary of the contents follows. 



Summary 

1: Chapter 1 contains a sampling of the standard, pointwise convergence theo- 
rems dealing with partial sums of independent random variables. These include 
the Weak and Strong Laws of Large Numbers as well as Hartman-Wintner’s Law 
of the Iterated Logarithm. In preparation for the Law of the Iterated Logarithm, 
Cramer’s theory of large deviations from the Law of Large Numbers is developed 
in § 1.4. Everything here is very standard, although I feel that my passage from 
the bounded to the general case of the Law of the Iterated Logarithm has been 
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considerably smoothed by the ideas that I learned during a conversation with 
M. Ledoux. 

2 : The whole of Chapter 2 is devoted to the classical Central Limit Theorem. 
After an initial (and slightly flawed) derivation of the basic result via moment 
considerations, Lindeberg’s general version is derived in §2.1. Although Linde- 
berg’s result has become a sine qua non in the writing of probability texts, the 
Berry-Esseen estimate has not. Indeed, until recently, the Berry-Esseen esti- 
mate required a good many somewhat tedious calculations with characteristic 
functions (i.e., Fourier transforms), and most recent authors seem to have de- 
cided that the rewards did not justify the effort. I was inclined to agree with 
them until P. Diaconis brought to my attention E. Bolthausen’s adaptation of 
C. Stein’s techniques (the so-called Stein’s method) to give a proof that is not 
only brief but also, to me, aesthetically pleasing. In any case, no use of Fourier 
methods is made in the derivation given in § 2.2. On the other hand, Fourier 
techniques are introduced in § 2.3, where it is shown that even elementary Fourier 
analytic tools lead to important extensions of the basic Central Limit Theorem 
to more than one dimension. Finally, in § 2.4, the Central Limit Theorem is ap- 
plied to the study of Hermite multipliers and (following Wm. Beckner) is used to 
derive both E. Nelson’s hypercontraction estimate for the Mehler kernel as well 
as Beckner’s own estimate for the Fourier transform. I am afraid that, with this 
flagrant example of the sort of thing that does not belong here , I may be trying 
the patience of my purist colleagues. However, I hope that their indignation 
will be somewhat assuaged by the fact that the rest of the book is essentially 
independent of the material in § 2.4. 

3 : This chapter is devoted to the study of infinitely divisible laws. It begins 
in §3.1 with a few refinements (especially The Levy Continuity Theorem) of 
the Fourier techniques introduced in § 2.3. These play a role in § 3.2, where 
the Levy-Khinchine formula is first derived and then applied to the analysis of 
stable laws. 

4: In Chapter 4 I construct the Levy processes (a.k.a. independent increment 
processes) corresponding to infinitely divisible laws. Secton 4.1 provides the req- 
uisite information about the pathspace D(W N ) of right-continuous paths with 
left limits, and § 4.2 gives the construction of Levy processes with discontinuous 
paths, the ones corresponding to infinitely divisible laws having no Gaussian 
part. Finally, in § 4.3 I construct Brownian motion, the Levy process with con- 
tinuous paths, following the prescription given by Levy. 

5 : Because they are not needed earlier, conditional expectations do not appear 
until Chapter 5. The advantage gained by this postponement is that, by the 
time I introduce them, I have an ample supply of examples to which condition- 
ing can be applied; the disadvantage is that, with considerable justice, many 
probabilists feel that one is not doing probability theory until one is condition- 
ing. Be that as it may, Kolmogorov’s definition is given in §5.1 and is shown 
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to extend naturally both to cr-finite measure spaces as well as to random vari- 
ables with values in a Banach space. Section 5.2 presents Doob’s basic theory 
of real-valued, discrete parameter martingales: Doob’s Inequality, his Stopping 
Time Theorem, and his Martingale Convergence Theorem. In the last part of 
§ 5.2, I introduce reversed martingales and apply them to DeFinetti’s theory of 
exchangeable random variables. 

6: Chapter 6 opens with extensions of martingale theory in two directions: to 
cr-finite measures and to random variables with values in a Banach space. The 
results in §6.1 are used in §6.2 to derive Birkhoff’s Individual Ergodic Theorem 
and a couple of its applications. Finally, in § 6.3 I prove Burkholder’s Inequality 
for martingales with values in a Hilbert space. The derivation that I give is 
essentially the same as Burkholder’s second proof, the one that gives optimal 
constants. 

7 : Section 7.1 provides a brief introduction to the theory of martingales with 
a continuous parameter. As anyone at all familiar with the topic knows, any- 
thing approaching a full account of this theory requires much more space than a 
book like this can give it. Thus, I deal with only its most rudimentary aspects, 
which, fortunately, are sufficient for the applications to Brownian motion that I 
have in mind. Namely, in § 7.2 I first discuss the intimate relationship between 
continuous martingales and Brownian motion (Levy’s martingale characteriza- 
tion of Brownian motion), then derive the simplest (and perhaps most widely 
applied) case of the Doob-Meyer Decomposition Theory, and finally show what 
Burkholder’s Inequality looks like for continuous martingales. In the conclud- 
ing section, §7.3, the results in §§ 7. 1-7.2 are applied to derive the Reflection 
Principle for Brownian motion. 

8: In § 8.1 I formulate the description of Brownian motion in terms of its Gaus- 
sian, as opposed to its independent increment, properties. More precisely, fol- 
lowing Segal and Gross, I attempt to convince the reader that Wiener measure 
(i.e., the distribution of Brownian motion) would like to be the standard Gauss 
measure on the Hilbert space H 1 (M N ) of absolutely continuous paths with a 
square integrable derivative, but, for technical reasons, cannot live there and 
has to settle for a Banach space in which is densely embedded. Using 

Wiener measure as the model, in § 8.2 I show that, at an abstract level, any 
non-degenerate, centered Gaussian measure on an infinite dimensional, separa- 
ble Banach space shares the same structure as Wiener measure in the sense 
that there is always a densely embedded Hilbert space, known as the Carneron- 
Martin space, for which it would like to be the standard Gaussian measure but 
on which it does not fit. In order to carry out this program, I need and prove 
Fernique’s Theorem for Gaussian measures on a Banach space. In § 8.3 I begin 
by going in the opposite direction, showing how to pass from a Hilbert space H 
to a Gaussian measure on a Banach space E for which H is the Cameron-Martin 
space. The rest of § 8.3 gives two applications: one to “pinned Brownian” motion 
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and the second to a very general statement of orthogonal invariance for Gaussian 
measures. The main goal of § 8.4 is to prove a large deviations result, known as 
Schilder’s Theorem, for abstract Wiener spaces; and once I have Schilder’s The- 
orem, I apply it to derive a version of Strassen’s Law of the Iterated Logarithm. 
Starting with the Ornstein-Uhlenbeck process, I construct in § 8.5 a family of 
Gaussian measures known in the mathematical physics literature as Euclidean 
free fields. In the final section, § 8.6, I first show how to construct Banach space- 
valued Brownian motion and then derive the original form of Strassen’s Law of 
the Iterated Logarithm in that context. 

9 : The central topic here is the abstract theory of weak convergence of prob- 
ability measures on a Polish space. The basic theory is developed in §9.1. In 
§ 9.2 I apply the theory to prove the existence of regular conditional probability 
distributions, and in §9.3 I use it to derive Donsker’s Invariance Principle (i.e., 
the pathspace statement of the Central Limit Theorem). 

10 : Chapter 10 is an introduction to the connections between probability the- 
ory and partial differential equations. At the beginning of §10.1 I show that 
martingale theory provides a link between probability theory and partial dif- 
ferential equations. More precisely, I show how to represent in terms of Wiener 
integrals solutions to parabolic and elliptic partial differential equations in which 
the Laplaeian is the principal part. In the second part of § 10.1, I use this link to 
calculate various Wiener integrals. In § 10.2 I introduce the Markov property of 
Wiener measure and show how it not only allows one to evaluate other Wiener 
integrals in terms of solutions to elliptic partial differential equations but also 
enables one to prove interesting facts about solutions to such equations as a con- 
sequence of their representation in terms of Wiener integrals. Continuing in the 
same spirit, I show in § 10.2 how to represent solutions to the Dirichlet problem 
in terms of Wiener integrals, and in § 10.3 I use Wiener measure to construct 
and discuss heat kernels related to the Laplaeian. 

11 : The final chapter is an extended example of the way in which probability 
theory meshes with other branches of analysis, and the example that I have cho- 
sen is the marriage between Brownian motion and classical potential theory. Like 
an ideal marriage, this one is simultaneously intimate and mutually beneficial to 
both partners. Indeed, the more one knows about it, the more convinced one be- 
comes that the properties of Brownian paths are a perfect reflection of properties 
of harmonic functions, and vice versa. In any case, in § 11.1 1 sharpen the results 
in § 10.2.3 and show that, in complete generality, the solution to the Dirichlet 
problem is given by the Wiener integral of the boundary data evaluated at the 
place where Brownian paths exit from the region. Next, in § 11.2, I discuss the 
Green function for a region and explain how its existence reflects the recurrence 
and transience properties of Brownian paths. In preparation for § 11.4, § 11.3 is 
devoted to the Riesz Decomposition Theorem for excessive functions. Finally, 
in § 11.4, I discuss the capacity of regions, derive Chung’s representation of the 
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capacitory measure in terms of the last place where a Brownian path visits a 
region, apply the probabilistic interpretation of capacity to give a derivation of 
Wiener’s test for regularity, and conclude with two asymptotic calculations in 
which capacity plays a crucial role. 



Suggestions about the Use of This Book 

In spite of the realistic assessment contained in the first paragraph of its preface, 
when I wrote the first edition of this book I harbored the naive hope that it 
might become the standard graduate text in probability theory. By the time 
that I started preparing the second edition, I was significantly older and far less 
naive about its prospects. Although the first edition has its admirers, it has 
done little to dent the sales record of its competitors. In particular, the first 
edition has seldom been adopted as the text for courses in probability, and I 
doubt that the second will be either. Nonetheless, I close this preface with a few 
suggestions for anyone who does choose to base a course on it. 

I am well aware that, except for those who find their way into the poorly 
stocked library of some prison camp, few copies of this book will be read from 
cover to cover. For this reason, I have attempted to organize it in such a way that, 
with the help of the table of dependence that follows, a reader can select a path 
which does not require his reading all the sections preceding the information he 
is seeking. For example, the contents of §§ 1.1-1. 2, §1.4, §2.1, §2.3, and § 5.1 
5.2 constitute the backbone of a one semester, graduate level introduction to 
probability theory. What one attaches to this backbone depends on the speed 
with which these sections are covered and the content of the courses for which 
the course is the introduction. If the goal is to prepare the students for a career 
as a “quant” in what is left of the financial industry, an obvious choice is § 4.3 
and as much of Chapter 7 as time permits, thereby giving one’s students a 
reasonably solid introduction to Brownian motion. On the other hand, if one 
wants the students to appreciate that white noise is not the only noise that they 
may encounter in life, one might defer the discussion of Brownian motion and 
replace it with the material in Chapter 3 and §§4.1-4. 2. 

Alternatively, one might use this book in a more advanced course. An intro- 
duction to stochastic processes with an emphasis on their relationship to partial 
differential equations can be constructed out of Chapters 6, 7, 10, and 11, and 
§ 4.3 combined with Chapter 8 could be used to provide background for a course 
on Gaussian processes. 

Whatever route one takes through this book, it will be a great help to your 
students for you to suggest that they consult other texts. Indeed, it is a familiar 
fact that the third book one reads on a subject is always the most lucid, and so 
one should suggest at least two other books. Among the many excellent choices 
available, I mention Wm. Feller’s An Introduction to Probability Theory and Its 
Applications, Vol. II, and M. Loeve’s classic Probability Theory. In addition, for 
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background, precision (including accuracy of attribution), and supplementary 
material, R. Dudley’s Real Analysis and Probability is superb. 
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Chapter 1 

Sums of Independent Random Variables 



In one way or another, most probabilistic analysis entails the study of large 
families of random variables. The key to such analysis is an understanding 
of the relations among the family members; and of all the possible ways in 
which members of a family can be related, by far the simplest is when there 
is no relationship at all! For this reason, I will begin by looking at families of 
independent random variables. 

§ 1.1 Independence 

In this section I will introduce Kolmogorov’s way of describing independence 
and prove a few of its consequences. 

§1.1.1. Independent cr- Algebras. Let P) be a probability space 

(i.e., 12 is a nonempty set, J- is a cr-algebra over 12, and P is a non-negative 
measure on the measurable space (12, J 7 ) having total mass 1), and, for each i 
from the (non-empty) index set I, let X) be a sub-tr-algebra of X". I will say 
that the a- algebras X), i £ X. are mutually P-independent, or, less precisely, 
P-independent, if, for every finite subset (*i, . . . , i n } of distinct elements of X 
and every choice of A m £ X) m , 1 < m < n, 

(1.1.1) P(A* n • • • n A J = P(AJ ■ ■ ■ P(A;J. 

In particular, if (A ; i £ X} is a family of sets from X", I will say that A> * G 
X, are P-independent if the associated cr-algebras X) = (0, Aj ACA}i * £ X, 
are. To gain an appreciation for the intuition on which this definition is based, 
it is important to notice that independence of the pair A and A in the present 
sense is equivalent to P(A n A) = P(A)P(A), the classical definition that 
one encounters in elementary treatments. Thus, the notion of independence just 
introduced is no more than a simple generalization of the classical notion of 
independent pairs of sets encountered in non-measure theoretic presentations, 
and therefore the intuition that underlies the elementary notion applies equally 
well to the definition given here. (See Exercise 1.1.8 for more information about 
the connection between the present definition and the classical one.) 

As will become increasing evident as we proceed, infinite families of indepen- 
dent objects possess surprising and beautiful properties. In particular, mutually 
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independent cr-algebras tend to fill up space in a sense made precise by the fol- 
lowing beautiful thought experiment designed by A.N. Kolmogorov. Let X be 
any index set, take Xg = {0, $2}, and, for each non-empty subset A C 1, let 

XV = \J X) . = a ( (J ?i 

ie a \iei 

be the u-algebra generated by (J ieA X) (he., XV is the smallest cr-algebra con- 
taining (J ieA X)). Next, define the tail cr-algebra T to be the intersection over 
all finite A C X of the cr-algebras XVc- When X itself is finite, T = (0, SX} and 
is therefore P-trivial in the sense that P(A) £ (0, 1} for every A £ T. The 
interesting remark made by Kolmogorov is that even when X is infinite, T is 
P-trivial whenever the original XV s are P- independent. To see this, for a given 
non-empty AC X, let C A denote the collection of sets of the form A it n ■ ■ ■ Ai n 
where { i i , . . . , i n } are distinct elements of A and Ai m £ X) m for each 1 < m < n. 
Clearly C A is closed under intersection and XV = cr(C A ). In addition, by assump- 
tion, P(A n B) = P(A)P(I?) for all A e C A and B £ C a q. Hence, by Exercise 
1.1.12, Xa is independent of XVc- But this means that T is independent of XV 
for every finite F Cl, and therefore, again by Exercise 1.1.12, T is independent 
of 

XV = o : F a finite subset of A}^ . 

Since T C XV, this implies that T is independent of itself ; that is, P(A n B) = 
P(A)P(5) for all A, B £ T- Hence, for every A £ T, P(A) = P(A) 2 , or, 
equivalently, P(A) £ {0,1}, and so I have now proved the following famous 
result. 

Theorem 1.1.2 (Kolmogorov’s 0—1 Law). Let (X) : i £ X} be a family 
of F-independeiit sub-cr -algebras of (H,X", P), and define the tail a-algebra T 
accordingly, as above. Then, for every A £ T, P(A) is either 0 or 1. 

To develop a feeling for the kind of conclusions that can be drawn from Kol- 
mogorov’s 0-1 Law (cf. Exercises 1.1.18 and 1.1.19 as well), let {A n : n > 1} be 
a sequence of subsets of H, and recall the notation 

OO 

lim A n = n u A n = {ta : uj £ A n for inhnitely many n £ Z + }. 

m= 1 n>m 



Obviously, lim^oo A n is measurable with respect to the tail held determined by 
the sequence of cr-algebras (0, A n , A„C, 0}, n £ Z + ; and therefore, if the A re ’s 
are P-independent elements of X", then 

P ( lim An) £ (0, 1}. 

\n—> oo / 
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In words, this conclusion can be summarized as follows: for any sequence of 
P -independent events A n , n G Z + , either P -almost every uj G 0 is in infinitely 
many A n ’s or P -almost every lo G 0 is in at most finitely many A n ’ s . A more 
quantitative statement of this same fact is contained in the second part of the 
following useful result. 

Lemma 1.1.3 (Borel— Cantelli Lemma). Let { A n : n G Z + ) C J 7 be given. 
Then 

OO 

(1.1.4) yr(A n ) < oo => P ( lim = 0. 

^ J \n—> oo / 

n=l 

In fact, if the A n ’s are F-independent sets, then 

OO 

(1.1.5) VP(A„) = oo ■<=>■ P ( lim = 1. 

^ J \n— >-oo / 

n = 1 

(See part (iii) of Exercise 5.2.40 and Lemma 11.4.14 for generalizations.) 

PROOF: The first assertion, which is due to E. Borel, is an easy application of 
countable additivity. Namely, by countable additivity, 

P ( lim Ah') = lim P ( II A n ) < lim P(A n ) = 0 

\n— >oo J m— >■ oo V J m— too J 

x n>m 7 n>m 



if XT= 1 F (- A n) < OO. 

To complete the proof of (1.1.5) when the A„’s are independent, note that, 
by countable additivity, P (lim^oo A n ) = 1 if and only if 

ito p( n a c) = p ( 0 n ac) = p((„k a.) c) = o. 

x n>m 7 \ m=l n>m J x 7 

But, by independence and another application of countable additivity, for any 
given m > 1 we have that 



P 




lim 

N — too 



N 

IT (l — P(A n )) < lim exp 

/V — VOO 



n=m 



E p (A) 



n=m 



= o 



if P(A n ) = oo. (In the preceding, I have used the trivial inequality 1 — t < 
e -i , t G [0, oo).) □ 

A second, and perhaps more transparent, way of dealing with the contents of 
the preceding is to introduce the non- negative random variable N(lu) G Z + U 
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{ 00 } , that counts the number of n G such that w G A n . Then, by Tonelli’s 
Theorem, 1 E P [1V] = P(H n ), and so Borel’s contribution is equivalent to 

the E P [1V] < 00 =>■ P(IV < 00 ) = 1, which is obvious, whereas Cantelli’s 
contribution is that, for mutually independent A n ’ s, P(N < 00 ) =>■ E p [./V] < 
00 , which is not obvious. 

§ 1.1.2. Independent Functions. Having described what it means for the a- 
algebras to be P-independent, I will now transfer the notion to random variables 
on P). Namely, for each t £ 1, let X; h be a random variable (i.e., a 

measurable function on (H, Ef) with values in the measurable space (E^Bf)). I 
will say that the random variables Xi, i G I, are (mutually) P-independent 
if the u-algebras 

a(Xi) = X~\Bf) = { X~\Bi ) : B, G B t }, i G X, 

are P-independent. If B(E-,M) = B((E, B);R) denotes the space of bounded 
measurable M- valued functions on the measurable space ( E,B ), then it should 
be clear that P-independence of {Xi : * el} is equivalent to the statement that 

E p [f h o Xi,--- f in o X in ] = E p [f h o4]...E p [f in o X in ] 

for all finite subsets {ii, . . . ,i n } of distinct elements of X and all choices of 
fh e B (. E j, ; M) , . . . , f in G B (E in ; M) . Finally, if 1 A given by 

{ 1 if 

lA M = | Q if w g A 

denotes the indicator function of the set dCU, notice that the family of sets 
{Ai : i G X} C X is P-independent if and only if the random variables l Ai , i G X, 
are P-independent. 

Thus far I have discussed only the abstract notion of independence and have 
yet to show that the concept is not vacuous. In the modern literature, the 
standard way to construct lots of independent quantities is to take products of 
probability spaces. Namely, if ( Ei,Bi,pi ) is a probability space for each i el, 
one sets H = \\ ie x Eh defines 7 H — s- Ei to be the natural projection map 
for each i e I; takes E % = 7T” 1 (£?;), iel, and E = \j ieX Eh and shows that 
there is a unique probability measure P on with the properties that 

P(7r“ 1 r. t ) = Hi{Ti) for all j el and E G B t 

1 Throughout this book, I use E F [X, A] to denote the expected value under P of X over the set 
A. That is, E F [X, A] = J X dP. Finally, when A = Q, I will write E P [X]. Tonelli’s Theorem 
is the version of Fubini’s Theorem for non-negative functions. Its virtue is that it applies 
whether or not the integrand is integrable. 
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and the tr-algebras Ft, i El, are P- independent. Although this procedure is 
extremely powerful, it is rather mechanical. For this reason, I have chosen to 
defer the details of the product construction to Exercises 1.1.14 and 1.1.16 and 
to, instead, spend the rest of this section developing a more hands-on approach 
to constructing independent sequences of real- valued random variables. Indeed, 
although the product method is more ubiquitous and has become the construc- 
tion of choice, the one that I am about to present has the advantage that it shows 
independent random variables can arise “naturally” and even in a familiar places. 

§1.1.3. The Rademacher Functions. Until further notice, take (11. J~) = 
([0, 1), £>[ 0 ,i)) (when E is a metric space, I use Be to denote the Borel field over 
E) and P to be the restriction A[ 0 ,i) of Lebesgue measure Ar to [0, 1). Next 
define the Rademacher functions R n , n G Z+, on Q, as follows. Take the 
integer part [t\ of t G M to be the largest integer dominated by t, and consider 
the function R : M — > { — 1,1} given by 

nw-f - 1 

l 1 if t- |_fj G [|,l) • 

The function R n is then defined on [0, 1) by 

Rn( uj) = R[2 n ~ 1 oS), n G and cu G [0, 1). 



I will now show that the Rademacher functions are P-independent. To this end, 
first note that every real- valued function / on ( — 1, 1} is of the form a + (3x, x G 
{ — 1, 1}, for some pair of real numbers a and /3. Thus, all that I have to show is 
that 

E p [(ai + PiR\) ■ ■ ■ ( OL n + fd n Rn)\ = ol\ • • ■ a n 

for any n G Z + and (au, (3 \ ),..., (a„, /3 n ) G M 2 . Since this is obvious when 
n = 1,1 will assume that it holds for n and need only check that it must also 
hold for n + 1, and clearly this comes down to checking that 

E p [F(i?i, . . . , Rn) R„+i] =0 

for any F : { — 1, 1}™ — > M. But (R ±, . . . , R n ) is constant on each interval 



m m + 1 A 
2 n ’ 2 n ) ’ 



0 <m< 2 n , 



whereas R n +i integrates to 0 on each I m , n - Hence, by writing the integral over 
O as the sum of integrals over the / m>n ’s, we get the desired result. 

At this point I have produced a countably infinite sequence of independent 
Bernoulli random variables (i.e., two- valued random variables whose range is 
usually either ( — 1, 1} or (0, 1}) with mean value 0. In order to get more general 
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random variables, I will combine our Bernoulli random variables together in a 
clever way. 

Recall that a random variable U is said to be uniformly distributed on the 
finite interval [a, b\ if 



f(U <t) = for t G [a, b]. 

Lemma 1.1.6. Let [X{ : I G Z + } be a sequence of P-indepenclent {0,1}- 
valued Bernoulli random variables with mean value | on some probability space 
(fl, IF, P), and set 

oo „ 

e=i z 

Then U is uniformly distributed on [0, 1]. 

PROOF: Because the assertion only involves properties of distributions, it will be 
proved in general as soon as I prove it for a particular realization of independent, 
mean value (0, l}-valued Bernoulli random variables. In particular, by the 
preceding discussion, I need only consider the random variables 

e n (ui) = ^ n ( a; ) , n g j+ anc j w g [o, 1), 

on ([0, 1), ^[o,i), A[ 0j i)) . But, as is easily checked (cf. part (i) of Exercise 1.1.11), 
for each uj G [0, 1], uj = Y^=\ 2 ~ n e n (co). Hence, the desired conclusion is trivial 
in this case. □ 

Now let (k,£) G W x Z + i — > n(k,£) G Z+ be any one-to-one mapping of 
x Z + onto Z + , and set 

Y k , e = l + R ^\ (fc, £) G (Z + ) 2 . 

Clearly, each Y k j is a (0, l}-valued, Bernoulli random variable with mean value 
|, and the family { : (k,£) G (^ + ) } is P-independent. Hence, by Lemma 
1.1.6, each of the random variables 

°° 

u k =J2^f, k G Z + , 

t=i 

is uniformly distributed on [0,1). In addition, the U k s are obviously mutually 
independent. Hence, I have now produced a sequence of mutually independent 
random variables, each of which is uniformly distributed on [0, 1). To complete 
our program, I use the time-honored transformation that takes a uniform random 
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variable into an arbitrary one. Namely, given a distribution function F on 
M (i.e., F is a right-continuous, non-decreasing function that tends to 0 at — oo 
and 1 at +oo), define F~ l on [0, 1] to be the left-continuous inverse of F. That 
is, 

F~ 1 (t) = inf{s G M : F(s) >t}, t G [0, 1]. 

(Throughout, the infemum over the empty set is taken to be +oo.) It is then an 
easy matter to check that when U is uniformly distributed on [0, 1) the random 
variable X = F _1 o U has distribution function F: 

P(A <t)= F(t), t G M. 

Hence, after combining this with what we already know, I have now completed 
the proof of the following theorem. 

Theorem 1.1.7. Let 0 = [0,1), F = H[ 0 , i), and P = A[ 0 ,i). Then, for 
any sequence {Fk : k G Z + j of distribution functions on M, there exists a 
sequence {Xk : k G Z + } of P-independent random variables on (0, F, P) with 
the property that P (A*, <t) = F k {t), t G M, for each k G Z + . 

Exercises for §1.1 

Exercise 1.1.8. As I pointed out, P(Ai fl A 2 ) = P(Ai)P(A 2 ) if and only 
if the cr-algebra generated by A\ is P-independent of the one generated by A 2 . 
Construct an example to show that the analogous statement is false when dealing 
with three, instead of two, sets. That is, just because P(Ai n A 2 fl A 3 ) = 
P(Ai)P(A 2 )P(A 3 ), show that it is not necessarily true that the three cr-algebras 
generated by A\, A 2 , and A 3 are P-independent. 

Exercise 1.1.9. This exercise deals with three elementary, but important, 
properties of independent random variables. Throughout, (12, J 7 , P) is a given 
probability space. 

(i) Let X\ and X 2 be a pair of P-independent random variables with values in 

the measurable spaces and ( E 2 ,B 2 ), respectively. Given a B\ x Im- 

measurable function F : E± x E 2 — > M that is bounded below, use Tonelli’s or 
Fubini’s Theorem to show that 

X 2 £ E 2 1 — > f{x 2 ) = E p [E(AT,x 2 )] e M 
is H 2 -measurable and that 

E p [F(X 1 ,X 2 )] =E p [/(A 2 )]. 

(ii) Suppose that X \,... , X n are P-independent, real- valued random variables. 
If each of the X m 's is P-integrable, show that X\ ■ ■ ■ X n is also P-integrable and 
that 

E p [X\ ■ ■ ■ X n \ =E P [AT] ■■■E p [A n ], 

(iii) Let {X n : n G Z + } be a sequence of independent random variables taking 
values in some separable metric space E. If P(A n = x) = 0 for all x G E and 
n G Z + , show that P(A m = X n for some m ^ n) =0. 
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Exercise 1.1.10. As an application of Lemma 1.1.6 and part (ii) of Exercise 
1.1.9, prove the identity 

OO 

sin 2 = z cos(2“ n 2) for all z £ C. 

n= 1 

Exercise 1.1.11. Define {e n (uj) : n > 1} for to G [0,1) as in the proof of 
Lemma 1.1.6. 

(i) Show that (e n (o>) : n > 1} is the unique sequence {a n : n > 1} C {0,1} Z 
such that io — Ylm= m oe m < 2 n , and conclude that ei(ut) = |_2ccJ and 
e n+ i(c<;) = [2 n+1 ojJ — 2|_2 n u;J for n > 1. 

(ii) Define F : [0, 1) — » [0, l) 2 by 

( OO OO \ 

^2-"e 2n _ 1 H,^2-"e 2n (a;) , 

n = 1 n = 1 / 

and show that A [ 0]1 )2 = F*A[o,;l). That is, A[ 0jl )({ca : F(uj) G T}) = A 2 0 1) (T) for 
all T G ^[o,i) 2 . 

(iii) Define G : [0, oo) 2 — * [0, 1) by 



G((wi,w 2 )) - ^2 

n= 1 



2e n (cui) + e n (cu 2 ) 

4 n 



and show that A[ 0jl ) = G*A[q,i) 2 . 

Parts (ii) and (iii) are special cases of a general principle that says, under 
very general circumstances, measures can be transformed into one another. 

Exercise 1.1.12. Given a non-empty set D, recall 2 that a collection C of subsets 
of O is called a 7r-system if C is closed under finite intersections. At the same 
time, recall that a collection C is called a A-system if O G £, AllB £ £ 
whenever A and B are disjoint members of £, B \ A G £ whenever A and B 
are members of C with A C B, and (J) 50 A n G C whenever { A n : n > 1} is a 
non-decreasing sequence of members of C. Finally, recall (cf. Lemma 3.1.3 in my 
Concise Introduction to the Theory of Integration ) that if C is a 7r-system, then 
the cr-algebra a{C) generated by C is the smallest £-system £ O C. 

Show that if C is a 7r-system and F = a (C), then two probability measures 
P and Q are equal on F if they are equal on C. Next use this to see that if 
{Ci : i G 1} is a family of 7r-systems contained in F and if (1-1.1) holds when 
the Afs are from the Cf s, then the family of tr-algebras (cr(Ci) : i G 1} is 
independent. 

2 See, for example, §3.1 in the author’s A Concise Introduction to the Theory of Integration, 
Third Edition, Birkhauser (1998). 
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Exercise 1.1.13. In this exercise I discuss two criteria for determining when 
random variables on the probability space ($1, E,P) are independent. 

(i) Let Xi, , . . ,X n be bounded, real- valued random variables. Using Weier- 
strass’s Approximation Theorem, show that the X m ’s are P-independent if and 
only if 

E p [X™ 1 ■ ■ ■ X™ n \ = E p |X"' ! ] ■ ■ ■ E p [X™"] 
for all mi, ... , m n G N. 

(ii) Let X : U — » R m and Y : 0 — » R" be random variables. Show that X 
and Y are P-independent if and only if 

E P exp V-l ((«,X) Rm + (/3, Y) Rn j 

= E P exp \f—l (a, X) Rm E p exp \/-l(/3,Y) R „ 

for all a G R m and (3 G R n . 

Hint: The only if assertion is obvious. To prove the if assertion, first check 
that X and Y are independent if 

E p [/(X) ff (Y)] =E P [/(X)] E p [g(Y)] 

for all / G CP°(R m ;C) and g G (IT ; C) . Second, given such / and g, apply 

elementary Fourier analysis to write 

/(x) = f e \/-T( Q - x )R m ip(a) dot and g( y) = f e 'EXus,y) w n i/j(/ 3) d/3, 

J M m jR n 

where <p and xp are smooth functions with rapidly decreasing (i.e., tending 
to 0 as |x| — > oo faster than any power of (1 + |x|) _1 ) derivatives of all orders. 
Finally, apply Fubini’s Theorem. 

Exercise 1.1.14. Given a pair of measurable spaces (Ei,B\) and {E 2 ,E 2 ), 
recall that their product is the measurable space {fE\ x E 2 ,£>i x B 2 ), where 
B\ x Bo is the cr-algebra over the Cartesian product space Ei x E 2 generated by 
the sets Fi x T 2 , Ti G B,. Further, recall that, for any probability measures /q, 
on ( Ei,Bi ), there is a unique probability measure x /x 2 on (Ei x E 2 ,B 1 x B 2 ) 
such that 

(/•H x g 2 ) (ri x r 2 ) = /zi(ri)/i 2 (r 2 ) for Ti G Bi. 

More generally, for any n > 2 and measurable spaces {(E,;, /?.,;) : 1 < i < n}, one 
takes \\ff &i to be the cr-algebra over E* generated by the sets Eli r*, Fi g Bi. 
In particular, since Y\fl +1 Ei and YI™ +1 Bi can be identified with (111 Ei) x 
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E n+ 1 and (n” &i) x &n+i, respectively, one can use induction to show that, for 
every choice of probability measures / q on ( Ei , B l ) , there is a unique probability 
measure fli on (fli fli &i) such that 



n \ n 

nr 4 =n^), 

. i / i 

The purpose of this exercise is to generalize the preceding construction to 
infinite collections. Thus, let X be an infinite index set, and, for each i el, 
let ( Ei,Bi ) be a measurable space. Given 0 ^ A C 1, use Ea to denote the 
Cartesian product space fX t gA and Tv to denote the natural projection map 
taking Ex onto Ea- Further, let Bx = stand for the cr-algebra over Ex 

generated by the collection C of subsets 

\ieF ) 




as F varies over non-empty, hnite subsets of I (abbreviated by 0 7 - F CC X). 
In the following steps, I outline a proof that, for every choice of probability 
measures /q on the (Xj,Ej)’s, there is a unique probability measure n ,ei Bi 011 
(E x,Bx) with the property that 



(1.1.15) 




/r*(Fi), Fj e Bi, 



for every 0 7^ F CC X. Not surprisingly, the probability space 




is called the product over X of the spaces (E*, B, , /q) ; and when all the factors 
are the same space {E,B,fi), it is customary to denote it by (E 1 , B x , /a 1 ), and 
if, in addition, X = ( 1 , . . . , N}, one uses (E iV , B N , /i N ) . 

(i) After noting (cf. Exercise 1.1.12) that two probability measures that agree on 
a 7r-system agree on the cr-algebra generated by that 7r-system, show that there 
is at most one probability measure on (Ex, Bx) that satisfies the condition in 
(1.1.15). Hence, the problem is purely one of existence. 

(ii) Let A be the algebra over Ex generated by C, and show that there is a finitely 
additive fj, : A — > [0, 1] with the property that 

//(tt^Tf)) = (n^) (!>)> 

VieF / 
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for all 0 7 ^ F CC X. Hence, all that one has to do is check that /x admits a 
(7-additive extension to Bi , and, by a standard extension theorem, this comes 
down to checking that A n ) \ 0 whenever { A n : n > 1} C A and A n \ 0. 
Thus, let {A n : n > 1} be a non-increasing sequence from A, and assume that 
/x(A n ) > e for some e > 0 and all n G Z + . One must show that Pi ] 50 A n 7 ^ 0. 

(iii) Referring to the last part of (ii), show that there is no loss in generality 
to assume that A n = nf (T Fn ) , where, for each n G 0 ^ F n CC X and 
T Fn G B i? n . In addition, show that one may assume that F\ = {i i } and that 
F n = F n - 1 U {i n }, n > 2, where {i n : n > 1} is a sequence of distinct elements 
of X. Now, make these assumptions, and show that it suffices to find ae G Ei e , 
£ G Z + , with the property that, for each m G Z + , (ai, . . . , a m ) G B Fm . 

( iv) Continuing (iii), for each m, n G Z + , define g m ,n ■ E p m — > [0, 1] so that 

9m, n (x Fm ) = lr F „ , • • • , x in ) if n<m 

and 

fa,nK) = / lr Fn ( x F m , yF n \F m ) I n Mu 1 (dy Fn \F m ) if n>m. 

jE Fn\F m V=m+1 / 

After noting that, for each m and n, g m ,n+\ < 9m, n and 



9m, n ( x F m ) 



/ 9m+l,n ( x F m > Uim+i ) Bim+i (dy im+ 1 ), 



set g m = linin-^oo g m ,n and conclude that 



9m ( x F m ) / 9m+l ( x F m , Vim+i) Bi m +i (^Mim+i) • 

JEi_ , , 



In addition, note that 




9i (xif) n ix (dxif) 



lim / 

n ^°° Je h 



9l,n (xii ) 9ii (d j X-i 1 ^ 



lim n{A n ) > e, 

n—> oo 



and proceed by induction to produce ae G Ei t , £ G Z + , so that 
9m({a i, . . . , a m )) > e for all m G Z+. 

Finally, check that {a m ■ m > 1} is a sequence of the sort for which we were 
looking at the end of part (iii). 
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Exercise 1.1.16. Recall that if $ is a measurable map from one measurable 
space ( E,B ) into a second one (E',3 1 ), then the distribution of under a 
measure p on ( E , B) is the pushforward measure (also denoted by /xod> _1 ) 
defined on ( E',B ') by 

$.Mr) = M(^ _ 1 (r)) for re &. 

Given a non-empty index set I and, for each i el, a measurable space (Ei, Bf) 
and an Xj-valucd random variable Xj on the probability space (fi, J 7 , P), define 
X : 0 — > Iliei Ei so that X(a;)* = Xj(cc) for each i £ I and cu e SY. Show 
that {Xj : * e Z} is a family of P-independent random variables if and only if 
X*P = Uiei (X*)*P. In particular, given probability measures pi on ( Ei,Bi ), 
set 

n = Y[E t , E = Y[Bi, P = n*, 

iei iei iei 

let Xj : f l — > Ei be the natural projection map from onto Ei, and show that 
{ Xj : i e 1} is a family of mutually P-independent random variables such that, 
for each i £ X, Xj has distribution p^ . 

Exercise 1.1.17. Although it does not entail infinite product spaces, an inter- 
esting example of the way in which the preceding type of construction can be 
effectively applied is provided by the following elementary version of a coupling 
argument . 

(i) Let (fi, B, P) be a probability space and X and Y a pair of P-square integrable 
R-valued random variables with the property that 

(X(w) - X(u/)) (y(w) - Y(u')) > 0 for all (w,u/) 6 O 2 . 

Show that 

E p [xy] > E p [x]E p [y], 

Hint: Define Xj and Yj on O 2 for i £ {1,2} so that Xj(w) = X(a»i) and 
Yj(u;) = Y(uji) when iv = ( or ,^), and integrate the inequality 

0 < (X( Wl ) - X(w 2 )) {Y(u i) - y(wa)) = (X x (u>) - X 2 («)) (Xi(w) - Y 2 (u>)) 

with respect to P 2 . 

(ii) Suppose that n £ and that / and g are M- valued, Borel measurable 
functions on M n that are non-decreasing with respect to each coordinate (sepa- 
rately). Show that if X = (Xi, . . . , X n ) is an M™ -valued random variable on a 
probability space (fi,R,P) whose coordinates are mutually P-independent, then 



E p [/(X) ff (X)] >E P [/(X)] E p [gr(X)] 



so long as /(X) and ry(X) are both P-square integrable. 
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Hint: First check that the case when n = 1 reduces to an application of (i). 
Next, describe the general case in terms of a multiple integral, apply Fubini’s 
Theorem, and make repeated use of the case when n = 1. 

Exercise 1.1.18. A cr-algebra is said to be countably generated if it contains 
a countable collection of sets that generate it. The purpose of this exercise is to 
show that just because a cr-algebra is itself countably generated does not mean 
that all its sub-cr-algebras are. 

Let P) be a probability space and {A n : n G Z + J C E a sequence of 

P- independent sub-subsets of E with the property that a <¥(A n ) < 1 — a for 
some a G (0, 1). Let E„ be the sub-cr-algebra generated by A n . Show that the 
tail cr-algebra T determined by {E n : n G Z + } cannot be countably generated. 

Hint: Show that C G T is an atom in T (i.e. , B = C whenever B G T \ {0} is 
contained in C ) only if one can write 



C= lim C n 

n— >■ oo 



u n 

m= 1 n>m 



where, for each n G Z + , C n equals either A n or A n C. Conclude that every 
atom in T must have P-measure 0. Now suppose that T were generated by 
: l G N}. By Kolmogorov’s 0-1 Law, P (Be) G {0, 1} for every <eN. Take 



Be 



Be 

Be C 



if P(B t ) = 1 
if P(B t ) = 0 



and set C = P) Be- 

ten 



Note that, on the one hand, P(C) = 1, while, on the other hand, C is an atom 
in T and therefore has probability 0. 

Exercise 1.1.19. Here is an interesting application of Kolmogorov’s 0-1 Law 
to a property of the real numbers. 

(i) Referring to the discussion preceding Lemma 1.1.6 and part (i) of Exercise 
1.1.11, define the transformations T n : [0, 1) — > [0, 1) for n G Z + so that 

T n (w) = u> w e -*-)> 



and notice (cf. the proof of Lemma 1.1.6) that T n (uj) simply flips the nth co- 
efficient in the binary expansion oj. Next, let T G t3[o,i), and show that T 
is measurable with respect to the cr-algebra cr({R n : n > m}) generated by 
{Rn : n > m} if and only if T n ( T ) = F for each 1 < n < m. In particular, 
conclude that A[ 0 ,i)(r) G {0, 1} if T„r = F for every n G Z + . 
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(ii) Let 5 denote the set of all finite subsets of Z + , and for each F E £, define 
T F : [0, 1) — » [0, 1) so that T® is the identity mapping and 

T Fu{m} =t f oTm for each F and me z+\F. 

As an application of (i), show that for every I E By 0; i) with A[ 0 ,i) (L) > 0, 




In particular, this means that if T has positive measure, then almost every 
oj E [0, 1) can be moved to T by flipping a finite number of the coefficients in the 
binary expansion of c a. 

§ 1.2 The Weak Law of Large Numbers 

Starting with this section, and for the rest of this chapter, I will be studying what 
happens when one averages independent, real-valued random variables. The 
remarkable fact, which will be confirmed repeatedly, is that the limiting behavior 
of such averages depends hardly at all on the variables involved. Intuitively, 
one can explain this phenomenon by pretending that the random variables are 
building blocks that, in the averaging process, first get homothetically shrunk 
and then reassembled according to a regular pattern. Hence, by the time that 
one passes to the limit, the peculiarities of the original blocks get lost. 

Throughout the discussion, (H, F , P) will be a probability space on which there 
is a sequence { X n : n > 1} of real- valued random variables. Given n E Z + , use 
S n to denote the partial sum Ah +■••-)- X n and S n to denote the average: 

n n ' 
e=i 

§ 1.2.1. Orthogonal Random Variables. My first result is a very general 
one; in fact, it even applies to random variables that are not necessarily inde- 
pendent and do not necessarily have mean 0. 

Lemma 1.2.1. Assume that 

E p [Xl] < oo for n E Z + and E p [X k X t ] = 0 if k + L 
Then, for each e > 0, 

i n 

(1.2.2) e 2 p(\S n \ > e) < E p [^] = ^E P [X £ 2 ] for n E Z+. 

n e=i 

In particular, if 

M = sup E P [V 2 ] < oo, 

ne z+ 

then 

(1.2.3) e 2 p(\S n \ >e) <E P [S^] < n E Z+ and e > 0; 

and so S n — > 0 in L 2 (P;M) and therefore also in P -probability. 
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PROOF: To prove the equality in (1.2.2), note that, by orthogonality, 

n 

EP [ 5 n] =X> P [Xf]. 
e = i 

The rest is just an application of Chebyshev’s inequality, the estimate that 
results after integrating the inequality 

e 2 i [£ ,oo)(l^l) <y 2 i [£l00) (|r|) <x 2 

for any random variable Y. □ 

§1.2.2. Independent Random Variables. Although Lemma 1.2.1 does 
not use independence, independent random variables provide a ready source of 
orthogonal functions. To wit, recall that for any P-square integrable random 
variable X, its variance Var(X) satisfies 

Var(X) = E p (x - E P [X]) 2 = E P [X 2 ] - (E P [X]) 2 < E P [X 2 ] . 

In particular, if the random variables X n , n G Z + , are P-square integrable and 
P-independent, then the random variables 

X n = X n — E p [A„] , n G Z + , 

are still P-square integrable, have mean value 0, and therefore are orthogonal. 
Hence, the following statement is an immediate consequence of Lemma 1.2.1. 

Theorem 1.2.4. Let { X n : n G Z + } be a sequence of P-independent, P-square 
integrable random variables with mean value m and variance dominated by a 2 . 
Then, for every n G Z + and e > 0, 

2 

(1.2.5) e 2 P(jS n -m\ > e) < E p (S n - m) 2 <? . 

In particular, S n — > m in L 2 (P;M) and therefore in P -probability. 

As yet I have made only minimal use of independence: all that I have done 
is subtract off the mean of independent random variables and thereby made 
them orthogonal. In order to bring the full force of independence into play, one 
has to exploit the fact that one can compose independent random variables with 
any (measurable) functions without destroying their independence; in particular, 
truncating independent random variables does not destroy independence. To see 
how such a property can be brought to bear, I will now consider the problem 
of extending the last part of Theorem 1.2.4 to X n ’s that are less than P-square 
integrable. In order to understand the statement, recall that a family of random 
variables {Xj : i G 1} is said to be uniformly P-integrable if 

lim supE p |XJ, \XA>R =0. 

R/co i e x L 1 

As the proof of the following theorem illustrates, the importance of this condition 
is that it allows one to simultaneously approximate the random variables Xj, i G 
I, by bounded random variables. 
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Theorem 1.2.6 (The Weak Law of Large Numbers). Let {X n : n G 
be a uniformly F-integrable sequence ofF-independent random variables. Then 



1 

n 



JpX m -E p [X m ]) 



0 in L 1 ( P; R) 



and therefore also in F -probability. In particular, if {X n : n G Z + } is a sequence 
ofF-independent, F-integrable random variables that are identically distributed, 
then S n — > E p [Xi] in L 1 (P;M) and F-probability. (Cf. Exercise 1.2.11.) 

PROOF: Without loss in generality, I will assume that E p [X n ] = 0 for every 
n G Z+ . 

For each R G (0, oo), define fn(t) = t 1 l-R,R](t), t G Mji 



m. 



4 K) =E p [/ fl oI„], X^ =f R oX n -mi R \ and Y™ = X n - if, 
and set 

1 n i n 

o( fl ) 1 QT , J t=?(-R) _ 1 V" vi R ) 



t = i 

Since E[X n ] = 0 =► mP = -E[X n , |X n | > /?], 



i=i 



E'[|S„|]<E'[|si' i, |] + E'[|Tl. B| |] 

< E'[|si'T] > + 2 max E'[|.Y«|, \X,\ > R] 

T) 

< —= + 2maxE p [|Xd, |Xd > fll; 
v/n ^ez+ li i i i j 



and therefore, for each R > 0, 

Ihh E p p„|] < 2 sup E p \\Xi\, \X e \ > R]. 

n^oo eeZ+ 



Hence, because the X^’s are uniformly P-integrable, we get the desired conver- 
gence in L 1 (P; R) by letting R /* oo. □ 

§ 1.2.3. Approximate Identities. The name of Theorem 1.2.6 comes from 
a somewhat invidious comparison with the result in Theorem 1.4.9. The reason 
why the appellation weak is not entirely fair is that, although The Weak Law 
is indeed less refined than the result in Theorem 1.4.9, it is every bit as useful 
as the one in Theorem 1.4.9 and maybe even more important when it comes 
to applications. What The Weak Law provides is a ubiquitous technique for 
constructing an approximate identity (i.e. , a sequence of measures that ap- 
proximate a point mass) and measuring how fast the approximation is taking 
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place. To illustrate how clever selections of the random variables entering The 
Weak Law can lead to interesting applications, I will spend the rest of this section 
discussing S. Bernstein’s approach to Weierstrass’s Approximation Theorem. 

For a given p £ [0,1], let {X n : n £ Z + J be a sequence of P-independent 
{0, l}-valued Bernoulli random variables with mean value p. Then 

P (S n = £) = Qp*(l - p) n ~ e for 0 < £ < n. 

Hence, for any / £ C'([0, 1];M), the nth Bernstein polynomial 

(1.2.7) B n (p-,f) = p (")/(!) p'(l 

of / at p is equal to 

E P [/oS„], 

In particular, 

\f(p)~B n (p-J)\ = \E p [f(p) - f °S n ] \ <E p [\f(p) - f oS n \] 

< 2||/|| u P(|S n -p| > e) + p(e; /), 

where ||/|| u is the uniform norm of / (i.e. , the supremum of |/| over the domain 
of /) and 



p(e; /) = sup{|/(i) — f{s) | : 0 < s < t < 1 with t — s < e} 

is the modulus of continuity of /. Noting that Var(X„) = p(l — p) < | and 
applying (1.2.5), we conclude that, for every e > 0, 

||/(p)-5 n (p;/)|| u <M^ +p(e;/). 

In other words, for all n £ Z + , 

(1.2.8) ||/ - B n {- ; /)|| u < P{n\f) = inf jjj^ + p(e; f) : e > o| . 

Obviously, (1.2.8) not only shows that, as n — > oo, B n { ■ ; /) — > / uniformly on 
[0, 1], it even provides a rate of convergence in terms of the modulus of continuity 
of /. Thus, we have done more than simply prove Weierstrass’s theorem; we have 
produced a rather explicit and tractable sequence of approximating polynomials, 
the sequence {B n (- ; /) : n £ Z + }. Although this sequence is, by no means, the 
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most efficient one, 1 as we are about to see, the Bernstein polynomials have a 
lot to recommend them. In particular, they have the feature that they provide 
non-negative polynomial approximates to non-negative functions. In fact, the 
following discussion reveals much deeper non-negativity preservation properties 
possessed by the Bernstein approximation scheme. 

In order to bring out the virtues of the Bernstein polynomials, it is impor- 
tant to replace (1.2.7) with an expression in which the coefficients of B n { - ; /) 
(as polynomials) are clearly displayed. To this end, introduce the difference 
operator A y for h > 0 given by 

[A .„/](() = + 

A straightforward inductive argument (using Pascal’s Identity for the binomial 
coefficients) shows that 

m , \ 

(-hr[A]ff](t) = ^2(-lY [ m )f(t + lh) for me Z+ 
e=o ' ' 

where denotes the mth iterate of the operator A^. Taking h = i, we now 
see that 

1=0 k = 0 k / \ / 

=E^E(;)(;:))(-ir'/w 

r =0 £=0 ' ' ' 

= E(!)w r l A 'd(0), 

r=0 ^ ' 

where A)) / = /. Hence, we have proved that 

(1.2.9) B n (p;f ) = ^2n~ e f n ^\ [A{f](0)p e for p e [0,1]. 

e=o ' ' 

The marked resemblance between the expression on the right-hand side of 
(1.2.9) and a Taylor polynomial is more than coincidental. To demonstrate how 



1 See G.G. Lorentz’s Bernstein Polynomials, Chelsea Publ. Co. (1986) for a lot more information. 
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one can exploit the relationship between the Bernstein and Taylor polynomials, 
say that a function ip £ C°° ((a, 6); M) is absolutely monotone if its mth deriva- 
tive D m xp is non-negative for every m £ N. Also, say that tp £ C' oo ([0, 1]; [0, 1]) 
is a probability generating function if there exists a {u n : n £ N} C [0, 1] 
such that 

OO OO 

u n = 1 and <p(t) = ^ u n t n for t £ [0, 1]. 

n = 0 n=0 

Obviously, every probability generating function is absolutely monotone on (0, 1). 
The somewhat surprising (remember that most infinitely differentiable functions 
do not admit power series expansions) fact which I am about to prove is that, 
apart from a multiplicative constant, the converse is also true. In fact, one does 
not need to know, a priori, that the function is smooth so long as it satisfies a 
discrete version of absolute monotonicity. 

Theorem 1.2.10. Let tp £ C([0, 1];M) with 1) = 1 he given. Then the 
following are equivalent: 

(i) ip is a probability generating function, 

(ii) the restriction of ip to (0, 1) is absolutely monotone; 

(iii) [A?y>](0) > 0 for every n £ N and 0 < m < n. 

n 

PROOF: The implication (i) =>■ (ii) is trivial. To see that (ii) implies (iii), first 
observe that if if is absolutely monotone on (a, b ) and h £ (0, b — a), then A h il> 
is absolutely monotone on (a, b — h). Indeed, because D o A h'tp = A h o Dif on 
(a, b — h), we have that 

h[D m o A^](f) = J D m+ V(s) ds > 0, te{a,b-h), 

for any m £ N. Returning to the function tp, we now know that A™p is absolutely 
monotone on (0, 1 — mh ) for all m £ N and h > 0 with mh < 1. In particular, 

[A£VK°) = >0 if mh < 1, 

and so [A™</?] (0) > 0 when h = and 0 < m < n. Moreover, since 

[A"¥>](0)= lim[A^](0), 

h/'f 

we also know that [A^y?] (0) > 0 when h = N and this completes the proof that 
(ii) implies (iii). 

Finally, assume that (iii) holds and set <p n = B n ( ■ ; ip). Then, from (1.2.9) and 
the equality tp n (f) = y>(l) = 1, we see that each tp n is a probability generating 
function. Thus, in order to complete the proof that (iii) implies (i), all that 
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one has to do is check that a uniform limit of probability generating functions 
is itself a probability generating function. To this end, write 



OO 

ip n {t) = ^ u n ,et e , t G [0, 1] for each n G Z + . 
e=o 

Because the u n / s are all elements of [0, 1], one can use a diagonalization proce- 
dure to choose {nk : k G Z+} so that 

lim u Uh ,i = un G [0, 1] 

k—¥ oo 

exists for each I e N. But, by Lebesgue’s Dominated Convergence Theorem, 
this means that 

OO 

ip(t) = lim ip nk (t) = for every t G [0, 1). 

fc-Voo • ^ 



Finally, by the Monotone Convergence Theorem, the preceding extends imme- 
diately to t = 1, and so ip is a probability generating function. (Notice that 
the argument just given does not even use the assumed uniform convergence 
and shows that the pointwise limit of probability generating functions is again 
a probability generating function.) □ 

The preceding is only one of many examples in which The Weak Law leads 
to useful ways of forming an approximate identity. A second example is given 
in Exercises 1.2.12 and 1.2.13. My treatment of these is based on that of Wm. 
Feller. 2 



Exercises for § 1.2 

Exercise 1.2.11. Although, for historical reasons, The Weak Law is usually 
thought of as a theorem about convergence in P-probability, the forms in which 
I have presented it are clearly results about convergence in either P-mean or 
even P-square mean. Thus, it is interesting to discover that one can replace the 
uniform integrability assumption made in Theorem 1.2.6 with a weak uniform in- 
tegrability assumption if one is willing to settle for convergence in P-probability. 
Namely, let X \, . . . , X n , ... be mutually P-independent random variables, as- 
sume that 

F(R ) = sup BF(\X n \ > r\ — » 0 as R oo, 

nez+ ' ' 



2 Wm. Feller, An Introduction to Probability Theory and Its Applications, Vol. II, Wiley, Series 
in Probability and Math. Stat. (1968). Feller provides several other similar applications of The 
Weak Law, including the ones in the following exercises. 




Exercises for § 1.2 



21 



and set 

1 n r i 

m n = — E p Xg, \XA < n , n G Z + . 
n ' L J 

f=i 

Show that, for each e > 0, 

1 n 

P ( I S n — m n I > e) < t — -7 V E p Xj , \xA < n + P ( max \xA > n) 
v 1 ) (neV ^ L 1 1 J Vi<^<n 1 1 J 

K ’ 1=1 

2 r n 

< — 5 - / F(t)dt + F(n), 
nt Jo 

and conclude that \S n — m n I — > 0 in P-probability. (See part (ii) of Exercises 
1.4.26 and 1.4.27 for a partial converse to this statement.) 

Hint: Use the formula 



Var(F) < E p [y 2 ] = 2 [ tP(|U| > t) dt. 

J [0,oo) 

Exercise 1 . 2 . 12 . Show that, for each T e [ 0 , 00) and t G ( 0 , 00), 



lim e nt 

n—too 



E 

0 <k<nT 



( nt) k 
k\ 



1 if T > t 

0 if T < t. 



Hint: Let X \, . . . , X n , ... be P-independent, N-valucd Poisson random vari- 
ables with mean value t. That is, the X n 's are P-independent and 

t k 

P (X n = k)=e~ t — for A: G N. 

Show that S n is an N-valucd Poisson random variable with mean value nt , and 
conclude that, for each T G [0, 00 ) and t G (0, 00 ), 

<=■"’ E = -p( 3„ < r). 

0 <k<nT 



Exercise 1.2.13. Given a right-continuous function F : [0, 00) — of bound- 
ed variation with F(0) = 0, define its Laplace transform A), A G [0, 00), by 
the Riemann-Stieltjes integral: 

ip(\) = [ e~ xt dF(t). 

J [0,oo) 

Using Exercise 1.2.12, show that 

[ Dk v\ ( n ) — > F ( T ) as n -> 00 

k<nT 



for each T G [ 0 , 00) at which F is continuous. Conclude, in particular, that 
F can be recovered from its Laplace transform. Although this is not the most 
practical recovery method, it is distinguished by the fact that it does not involve 
complex analysis. 
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§ 1.3 Cramer’s Theory of Large Deviations 

From Theorem 1.2.4, we know that if {X n : n G Z + J is a sequence of P- 
independent, P-square integrable random variables with mean value 0, and if 
the averages S n . n G Z + , are defined accordingly, then, for every e > 0, 

t®(\ c I \ ^ ^ maxi< m < n Var(X m ) „ £ ^+ 



> e < 



Thus, so long as 



Var(X„ 



0 as n 



the Sn s are becoming more and more concentrated near 0, and the rate at 
which this concentration is occurring can be estimated in terms of the variances 
Var(X n ). In this section, we will see that, by placing more stringent integrability 
requirements on the X n ’s, one can gain more information about the rate at which 
the Sn s are concentrating at 0. 

In all of this analysis, the trick is to see how independence can be combined 
with 0 mean value to produce unexpected cancellations; and, as a preliminary 
warm-up exercise, I begin with the following. 

Theorem 1.3.1. Let {X n : n G Z + } be a sequence of P-independent, P- 
integrable random variables with mean value 0, and assume that 



M 4 = sup E P [X^] < oo. 

ne z+ 



Then, for each e > 0, 



(1.3.2) 



e 4 P(|Sn| >e) < E p [ S n [ < 



In particular, S n 



0 P -almost surely. 



PROOF: Obviously, in order to prove (1.3.2), it suffices to check the second 
inequality, which is equivalent to E p [5^] < 3 A f 4 n 2 . But 

n 

EP [^]= E E P [^m 1 ---X m4 ] ) 

and, by Schwarz’s Inequality, each of these terms is dominated by AL 4 . In addi- 
tion, of these terms, the only ones that do not vanish have either all their factors 
the same or two pairs of equal factors. Thus, the number of non-vanishing terms 
is n + 3 n(n — 1) = 3n 2 — 2 n. 

Given (1.3.2), the proof of the last part becomes an easy application of the 
Borel-Cantelli Lemma. Indeed, for any e > 0, we know from (1.3.2) that 

OO 

E f> (l‘ s '«l > e ) < °°> 

n= 1 



and therefore, by (1.1.4), that P (lim n ^, 00 S n > e) = 0. □ 
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Remark 1.3.3. The final assertion in Theorem 1.3.1 is a primitive version of 
The Strong Law of Large Numbers. Although The Strong Law will be taken up 
again, and considerably refined, in Section 1.4, the principle on which its proof 
here was based is an important one: namely, control more moments and you 
will get better estimates; get better estimates and you will reach more refined 
conclusions. 

With the preceding adage in mind, I will devote the rest of this section to 
examining what one can say when one has all moments at one’s disposal. In fact, 
from now on, I will be assuming that Xi, , X n , . . . are independent random 
variables with common distribution y having the property that the moment 
generating function 

(1.3.4) M m (£) = f e^ x y(dx) < oo for all £ £ M. 

Jr 

Obviously, (1.3.4) is more than sufficient to guarantee that the X n ’s have mo- 
ments of all orders. In fact, as an application of Lebesgue’s Dominated Conver- 
gence Theorem, one sees that £ £ K i — > M (£) £ (0, oo) is infinitely differentiable 
and that 

r rjn Jlf 

E p [Af] = / x n y(dx) = (0) for all n £ N. 

Jr df n 

In the discussion that follows, I will use m and o 2 to denote, respectively, the 
common mean value and variance of the X n ’s. 

In order to develop some intuition for the considerations that follow, I will 
first consider an example, which, for many purposes, is the canonical example in 
probability theory. Namely, let g : M — > (0, oo) be the Gauss kernel 

(1.3.5) g(y) = -^=ex p , y £ I, 

and recall that a random variable X is standard normal if 

P(A£T) =J g(y)dy, T £ 6r. 

In spite of their somewhat insultingly bland moniker, standard normal random 
variables are the building blocks for the most honored family in all of probability 
theory. Indeed, given m £ M and a £ [0, oo), the random variable Y is said to 
be normal (or Gaussian) with mean value m and variance a 2 (often this 
is abbreviated by saying that X is an IV (m, a 2 ) -random variable) if and only 
if the distribution of Y is 7 mjCT 2 , where 'fm,# 2 is the distribution of the variable 
aX + rn when X is standard normal. That is, Y is an N(m, a 2 ) random variable 
if, when cr = 0, P(V = m) = 1 and, when cr > 0, 

P {Y eT)=j^g ) dy for L £ B R . 
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There are two obvious reasons for the honored position held by Gaussian 
random variables. In the first place, they certainly have finite moment generating 
functions. In fact, since 



it is clear that 
(1.3.6) 




v did) dy = exp 




C S M, 



^7 ra .„ a (0 = exp 



fm + 



2 



Secondly, they add nicely. To be precise, it is a familiar fact from elemen- 
tary probability theory that if X is an N (in, a 2 ) -random variable and X is 
an N(m, a 2 ) -random variable that is independent of X, then X + X is an 
N ( m + m, a- 2 + a 2 ) -random variable. In particular, if X \, . . . , X n are mutually 
independent, standard normal random variables, then S n is an N (0, ^ ) -random 
variable. That is, 



■(s„er) = x /- / exp 



rat 



dy. 



Thus (cf. Exercise 1.3.16), for any T we see that 



(1.3.7) lim -log P(S n G T) 

n— >• oo fl L v 7 



= —ess 



inf : yer 



where the “ess” in (1.3.7) stands for essential and means that what follows is 
taken modulo a set of measure 0. (Hence, apart from a minus sign, the right- 
hand side of (1.3.7) is the greatest number dominated by for Lebesgue- almost 

every y £ T.) In fact, because 

POO 

/ g(y) dy < x~ 1 g(x ) for all x € (0, oo), 

J X 

we have the rather precise upper bound 



i\S n \ >e) < 



nire* 



exp 



ne 

2 



for e > 0. 



At the same time, it is clear that, for 0 < e < |a|, 



P(\S n - a\ < e) > 



2e 2 r 



exp 



ra a + e) 
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More generally, if the X n ’s are mutually independent N(m, <7 2 )-random vari- 
ables, then one finds that 



mre^ 



exp 



P(|5„ — m\ > ae) < 
and, for 0 < e < |a| and sufficiently large n’s 
P(\S n - (m + a) | < ae) > 



ne 

'~2 
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for e > 0; 



1 2 e 2 n 


n(\a\ + e) 2 " 


V 7T exp 


2 



Of course, in general one cannot hope to know such explicit expressions for the 
distribution of S n . Nonetheless, on the basis of the preceding, one can start to 
see what is going on. Namely, when the distribution p falls off rapidly outside of 
compacts, averaging n independent random variables with distribution p has the 
effect of building an exponentially deep well in which the mean value m lies at the 
bottom. More precisely, if one believes that the Gaussian random variables are 
normal in the sense that they are typical, then one should conjecture that, even 
when the random variables are not normal, the behavior of P(|S n — m| > e) for 
large n’s should resemble that of Gaussians with the same variance; and it is in 
the verification of this conjecture that the moment generating function M ^ plays 
a central role. Namely, although an expression in terms of p for the distribution 
of S n is seldom readily available, the moment generating function for S n is easily 
expressed in terms of M tl . To wit, as a trivial application of independence, we 
have 

E p [e^ 5 ”] = ?gK. 

Hence, by Markov’s Inequality applied to e ^ Sn , we see that, for any a G E, 

P {S n >a)< = exp[— n(£a - A^))], £ e [0,oo), 



where 

(1-3-8) A m (0 = log (M^)) 

is the logarithmic moment generating function of /j. The preceding rela- 
tion is one of those lovely situations in which a single quantity is dominated by a 
whole family of quantities, which means that one should optimize by minimizing 
over the dominating quantities. Thus, we now have 



P( S n > a) < exp 



sup (fa - A m (£)) . 

£e[0,oo) 



(1.3.9) 



—n 
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Notice that (1.3.9) is really very good. For instance, when the X n ’s are N (m, a 2 )- 
random variables and cr > 0, then (cf. (1.3.6)) the preceding leads quickly to the 
estimate 

IP (S n ~ m > e) < exp , 

which is essentially the upper bound at which we arrived before. 

Taking a hint from the preceding, I now introduce the Legendre transform 

(1.3.10) = sup{^x - A^(£) : £ G M}, iGM, 

of and, before proceeding further, make some elementary observations about 
the structure of the functions and I f. 

Lemma 1.3.11. The function A ^ is infinitely differentiable. In addition, for 
each (6M, the probability measure on M given by 

^( r ) = j r eix dx ) for r e 

has moments of all orders, 



[ xv^dx) = A' ( 0 , and f x 2 u^dx) - f f x v s (dx)) = A"(£). 
JR JR \JR J 



Next, the function 1^ is a [0, oo]- valued, lower semicontinuous, convex function 
that vanishes at m. Moreover, 

I,x{x) = sup{£a: — A M (£) : f > 0} for x G [m, oo) 



and 



In( x ) = sup{£a; — A M (£) : £ < 0} for x G (-oo, m\. 



Finally, if 

a = inf{x gR: j»( (— oo, x] ) >0} and f3 = sup{x G M : p{ [x, oo) ) > 0}, 

then is smooth on {a, (3) and identically +oo off of [a,f3]. In fact, either 
/u({m}) = 1 and a = m = (3 or m G (a,/3), in which case A^ is a smooth, 
strictly increasing mapping from M onto (a, (3), 

I^{x) = F. tl (x)x - A M (E /X (x)), x G (a,/3), where ~ / _ l = (A' M )~ 1 

is the inverse of A( t , p({a}) = e“h<( Q ) if a > — oo, and p({f3}) = e _ hd£) if 
(3 < oo. 
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PROOF: For notational convenience, I will drop the subscript “/z” during the 
proof. Further, note that the smoothness of A follows immediately from the 
positivity and smoothness of M, and the identification of A'(£) and A"(£) with 
the mean and variance of is elementary calculus combined with the remark 
following (1.3.4). Thus, I will concentrate on the properties of the function I. 

As the pointwise supremum of functions that are linear, I is certainly lower 
semicontinuous and convex. Also, because A(0) = 0, it is obvious that I > 0. 
Next, by Jensen’s Inequality, 

A(£) > £ / xy(dx) = £m, 

Jr 

and, therefore, fx — A(£) < 0 if x < m and £ > 0 or if x > m and £ < 0. Hence, 
because / is non-negative, this proves the one-sided extremal characterizations 
of Ifj,(x) depending on whether x > m or x < m. 

Turning to the final part, note first that there is nothing more to do in the 
case when fi({m}) = 1. Thus, assume that y({m}) < 1, in which case it is clear 
that m G (a,/?) and that none of the measures is degenerate (i.e., concentrate 
at one point). In particular, because A"(£) is the variance of the v%, we know 
that A" > 0 everywhere. Hence, A' is strictly increasing and therefore admits a 
smooth inverse H on its image. Furthermore, because A'(£) is the mean of i^, it 
is clear that the image of A' is contained in (a, (3). At the same time, given an 
x G (a, f3), note that 



e f e^ v y(dy) — » oo as |£| — > oo, 

Jr 

and therefore £ fx — A(£) achieves a maximum at some point G M. In 
addition, by the first derivative test, A'(£ x ) = x, and so £ x = S _1 (a;). Finally, 
suppose that j3 < oo. Then 

e~^ f e^ y n(dy) = f e - ^^^ y{dy) \ y{{(3}) as £ — > oo, 

J R J (—oo,(3] 

and therefore e~ I( -^ = inf^> 0 e _ ^Af(£) = y{{/3}). Since the same reasoning 
applies when a > — oo, we are done. □ 

Theorem 1.3.12 (Cramer’s Theorem). Let {X n : n > 1} he a sequence of 
P -independent random variables with common distribution y, assume that the 
associated moment generating function satisfies (1.3.4), set m = f R x y(dx), 
and define / /t accordingly, as in (1.3.10). Then, 



P( S n > a) < e nI D a ) f or a j i a g [ mj oo), 
P( S n < a) < e _n ^^ a ^ for all a G (— oo,m]. 
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Moreover, for a G (a,/3) (cf. Lemma 1.3.11), e > 0, and n G 

*(«)) 



\Sn ~ a \ < e > 1 - 



ne 2 



exp 



—n(l/j,(a) + e|S M (a)| 



where is the function given in (1.3.8) and = ( A ^') 

PROOF: To prove the first part, suppose that a G [m, oo), and apply the second 
part of Lemma 1.3.11 to see that the exponent in (1.3.9) equals —nl^(a), and, 
after replacing { X n : n > 1} by {—X n : n > 1}, one also gets the desired 
estimate when a < m. 

To prove the lower bound, let a G [m, /3) be given, and set f = E^(a) G 
[0,oo). Next, recall the probability measure v £ described in Lemma 1.3.11, and 
remember that has mean value a = A^(£) and variance A"(£). Further, if 
{Y n : n G Z+j is a sequence of independent, identically distributed random 
variables with common distribution z^, then it is an easy matter to check that, 
for any n G Z + and every -measurable F : R n — > [0, oo), 

1 






F(Y y n ) 



M,(0' 



-E ff 



e £Sn F(Xi, . . . , Xn) 



In particular, if 



n rp 

T n = X^Y e and T n = — , 
' n 



i=i 



then, because / /4 (a) = fa — A^(£), 



>(|S n -a| <e) =M(f) n E v 



0 -dT„ 



, T n — a < e 



> e ~ n ^ a+e) M(f) n ¥(\T n - a| < e) 



= exp 



— n 



(j»+£e)]p( 



T n - a < e . 



But, because the mean value and variance of the Y n ’s are, respectively, a and 
A"(£), (1-2.5) leads to 

A"(£) 



\T„ — a\ > e < 



ne* 



The case when a G (a, m\ is handled in the same way. □ 



Results like the ones obtained in Theorem 1.3.12 are examples of a class of 
results known as large deviations estimates. They are large deviations be- 
cause the probability of their occurrence is exponentially small. Although large 
deviation estimates are available in a variety of circumstances, 1 in general one 
has to settle for the cruder sort of information contained in the following. 

1 In fact, some people have written entire books on the subject. See, for example, J.-D. 
Deuschel and D. Stroock, Large Deviations, now available from the A.M.S. in the Chelsea 
Series. 
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Corollary 1.3.13. For any T g Br, 

- inf I^x) < lim -log[p(S' n G T) 

repo ^ 7 n K ' 

AKzi- n — >oo L J 

< lim — log P(S n G r) < — inf P(x). 
n^ocn L C xe f 

(I use T° and Y to denote the interior and closure of a set T. Also, recall that I 
take the infemum over the empty set to be +oo.) 

Proof: To prove the upper bound, let T be a closed set, and define T + = 
T n [m, oo) and T_ = T n (— oo, m]. Clearly, 

P(5 n er) <2P(5 n GT+) VP(S n GT_). 

Moreover, if T + 0 and a + = min{x : x G T + }, then, by Lemma 1.3.11 and 
Theorem 1.3.12, 

/„(«+) = inf {I^x) : x G T+} and P(S n G T+) < e ~ nI ^ a +\ 

Similarly, if T_ 0 and a- = max{r : x G T_ } , then 

J M (a_) = inf {I^x) : x G T_ } and P(S n G T_) < e ~ nI ^ a -\ 

Hence, either T = 0, and there is nothing to do anyhow, or 

P( S n G T) < 2 exp [— ninf {/ p (x) : x G T}] , n G Z + , 

which certainly implies the asserted upper bound. 

To prove the lower bound, assume that T is a non-empty open set. What I 
have to show is that 

lim — log P(S n G T) > -J M (a) 

n — >oo Ti \- J 

for every a G T. If a G T n (a, /?), choose 6 > 0 so that (a — 5, a + 5) CT and 
use the second part of Theorem 1.3.12 to see that 

lim -log P(S n G T) > -/^(a) - e|S M (a)| 

n— >-oo Tl -I 

for every e G (0, 5). If a ^ [a, /?], then / M (a) = oo, and so there is nothing to do. 
Finally, if a G {a,/?}, then p({a}) = e _ h*(<d anc l therefore 

P (S n G T) > P(S n = a) > e ~ nI ^ a) . □ 
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Remark 1.3.14. The upper bound in Theorem 1.3.12 is often called Cher- 
noff’s Inequality. The idea underlying its derivation is rather mundane by 
comparison to the subtle idea underlying the proof of the lower bound. Indeed, 
it may not be immediately obvious what that idea was! Thus, consider once 
again the second part of the proof of Theorem 1.3.12. What I had to do is esti- 
mate the probability that S n lies in a neighborhood of a. When a is the mean 
value m, such an estimate is provided by the Weak Law. On the other hand, 
when a 7 ^ m, the Weak Law for the X„’s has very little to contribute. Thus, 
what I did is replace the original X n ’s by random variables Y n , n G Z + , whose 
mean value is a. Furthermore, the transformation from the X n ’s to the Y n ’s was 
sufficiently simple that it was easy to estimate X n -probabilities in terms of Y n - 
probabilities. Finally, the Weak Law applied to the Wi’s gave strong information 
about the rate of approach of ^ ]G ” =1 Yg. to a. 

I close this section by verifying the conjecture (cf. the discussion preceding 
Lemma 1.3.11) that the Gaussian case is normal. In particular, I want to check 
that the well around m in which the distribution of S n becomes concentrated 
looks Gaussian, and, in view of Theorem 1.3.12, this comes down to the following. 

Theorem 1.3.15. Let everything be as in Lemma. 1.3.11, and assume that 
the variance o 2 > 0. There exists a 6 G (0, 1] and a K £ (0, 00 ) such that 
[m — S,m + <5] C (a, /3) (cf. Lemma 1.3.11 ), | A" (H(x)) | < K, 

|H M (a;)| < K\x — m\, and 



I u( x ) ~ 



(x — m)~ 



2(j 2 



< K\x — m\ 



for all x G [m — 5 , m + S\. In particular, if 0 < e < 6, then 



P(|5 n — m| > e) < 2 exp 



(h ~ Kt< 



and if la — ml < 6 and e > 0, then 



P(|S„ - a| < e) > 1 2 ex P 



K 



ne z 



— n 



| a — m | 
2 ct 2 



+ K\a — m\(e + \a — m\ 




PROOF: Without loss in generality (cf. Exercise 1.3.17), I will assume that m = 
0 and <t 2 = 1. Since, in this case, A M (0) = A' M (0) = 0 and A"(0) = 1 , it 
follows that H m ( 0) = 0 and ^(,(0) = 1. Hence, we can find an M G (0, 00 ) 
and a <5 G (0,1] with a < —5 < 5 < f3 for which |E M (a;) — x\ < M\x\ 2 and 
|A m (0 — ^-| < M\f\ 3 whenever |x| < 5 and |^| < (M + 1)<5, respectively. In 
particular, this leads immediately to |S A 1 (x)| < (M + l)|x| for |x| < 6, and 
the estimate for 1^ comes easily from the preceding combined with equation 
I^ix) = E(x)x - A m (5 m (x)). □ 
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Exercises for § 1.3 

Exercise 1.3.16. Let ( E , X, p) be a measure space and / a non-negative, 
^-measurable function. If either p(E) < oo or / is p-integrable, show that 

II/I|lp(>;R) >||/||l°°(/i;R) US p > OO. 

Hint: Handle the case p{E) < oo first, and treat the case when / G L 1 (p-,R) 
by considering the measure u(dx) = f(x) p(dx). 

Exercise 1.3.17. Referring to the notation used in this section, assume that 
p is a non-degenerate (i.e., it is not concentrated at a single point) probability 
measure on R for which (1.3.4) holds. Next, let m and a 2 be the mean and 
variance of p, use u to denote the distribution of 

qQ — jyi 

i — > G R under p, 

a 

and define A u , I u , and E u accordingly. Show that 

A M (£) =i m + A u (af), £eR, 

ln{x) = E o . W ^ , xeR, 

Image (A^) = m + a Image (A),) , 

E r( x ) = (— ~fp - ) » x e Image (A^) . 



Exercise 1.3.18. Continue with the same notation as in the preceding. 

(i) Show that I u < 7 /( if M /( < M v . 

(ii) Show that 

. . (x — m) 2 

l»{x) = — — 2 — , xel, 

when p is the N(m,a 2 ) distribution with cr > 0, and show that 



^(x) 



x — a x—a b — x b—x 

° g (1 -p){b-a) + b^a ° g p(b - a) ’ 



x G (a, b ), 



when a < 6, p G (0, 1), and /x({a}) = 1 — p({b}) = p. 

(iii) When p is the centered Bernoulli distribution given by p({±l}) = |, show 

that < exp ,(gR, and conclude that I^x) > fr, x G R. More 

generally, given n G Z + , {ofc : 1 < k < n} C R, and independent random 
variables X \ , . . . , X n with this p as their common distribution, let v denote the 




